Creation of structured data from plain text

ABSTRACT

A method and system for converting plain text into structured data. Parse trees for the plain text are generated based on the grammar of a natural language, the parse trees are mapped on to instance trees generated based on an application-specific model. The best map is chosen, and the instance tree is passing to an application for execution. The method and system can be used both for populating a database and/or for retrieving data from a database based on a query.

A portion of the disclosure of this patent document contains materialthat is subject to copyright protection. The copyright owner has noobjection to the facsimile reproduction by anyone of the patent documentor the patent disclosure, as it appears in the Patent and TrademarkOffice patent file or records, but otherwise reserves all copyrightrights whatsoever.

REFERENCE TO A COMPUTER PROGRAM LISTING APPENDIX

A computer program listing appendix is included in the attached CD-Rcreated on Dec. 12, 2000, labeled “Creation of Structured Data fromPlain Text,” and including the following files: CommodityProperty.nml(13 KB), DefaultSeg14Result.xml, (2 KB), ElectricalProperty.nml (16 KB),Example.txt, Grammar.txt, INML.xml, (5 KB), MeasurementProperty.nml (22KB), Output.txt, (3 KB), PeriodProperty.nml (6 KB), PhysicalProperty.nml(36 KB), ReservedNameProperty.nml (6 KB), Seg14.nml (30 KB),Seg14Phrasing.nml (71 KB), UsageProperty.nml (7 KB), and Utility.nml (6KB). These files are incorporated by reference herein.

BACKGROUND

A. Technical Field

The present invention relates to creation of structured data from plaintext, and more particularly, to creation of structured data from plaintext based on attributes or parameters of a web-site's content orproducts.

B. Background of the Invention

In recent years, the Internet has grown at an explosive pace. More andmore information, goods, and services are being offered over theInternet. This increase in the data available over the Internet has madeit increasingly important that users be able to search through vastamounts of material to find information that is relevant to theirinterests and queries.

The search problem can be described at least two levels: searchingacross multiple web-sites, and searching within a given site. The firstlevel of search is often addressed by “search engines” such as Google™or Alta Vista™ of directories such as Yahoo™. The second level, which isspecific to the content of a site, is typically handled by combinationsof search engines and databases. This approach has not been entirelysuccessful in providing users within effiencents access to a site'scontent.

The problem in searching a website or other information-technology basedservice is composed of two subproblems: first, indexing or categorizingthe corpora (body of material) to be searched (i.e., content synthesis),and second, interpreting a search request and executing it over thecorpora (i.e., content retrieval). In general, the corpora to besearched typically consist of unstructured information (textdescriptions) of items. For e-commerce web-sites, the corpora may be thecatalog of the items available through that web-site. For example, thecatalog entry for a description might well be the sentence “aquacashmere v-neck, available in small, medium, large, and extra large.”Such an entry cannot be retrieved by item type or attribute, since thefacts that v-neck is a style or sweater, cashmere a form of wool, andaqua a shade of blue, are unknown to current catalogs or search engines.In order to retrieve the information that this item is available, byitem type and/or attribute, this description must be converted into anattributed, categorized description. In this example, such anattributed, categorized description may include properly categorizingthe item as a sweater, extracting the various attributes, and taggingtheir values. An example of such a description is illustrated in Table1.

TABLE 1 Item Style Color Material Sizes Sweater v-neck Aqua Cashmere S,M, L, XL

Current technology permits such representations in databases. Further,for many standard items, numeric codes are assigned to make the job ofsearch and representation easier. One such code is the UN StandardProducts and Services Code (UN/SPSC), which assigns a standard 8-digitcode to any human product or service.

However, while the taxonomies and the technology to represent thetaxonomies may exist, conventional systems are unable to generate thetaxonomic and attributed representation for an object from its textualdescription. This leads to the first of the two problems outlined above:the content synthesis problem. More specifically, that is the problem ofhow to convert plain text into structured objects suitable for automatedsearch and other computational services.

The second problem is one of retrieving data successfully; once the datahas been created and attributed, it must be accessible. E-commerce andparametric content sites are faced with a unique challenge, since theymust offer search solutions that expose only those products, contents orservices that exactly match a customer's specifications. Today, morethan 50% of visitors use search as their preferred method for findingdesired goods and services. However, e-commerce web sites continue tooffer their customers unmatched variety, category-based navigation ofe-commerce sites (“virtual aisles”), which have become increasinglycomplex and inadequate. In particular, many web-sites that offer a largecatalog of products are often unable to find products with precise orhighly parameterized specifications, and instead require the user toreview dozens of products that potentially match these specifications.

A few statistics help to emphasize the importance of good searchingability. An important metric that measures the conversion rate ofvisitors to e-commerce sites into buyers is the book-to-look ratio. Theindustry average is that only 27 visitors in a 1000 make a purchase. Thebiggest contributor to this abysmal ratio is failed search. ForresterResearch reports that 92% of all e-commerce searches fail. Major sitesreport that 80% of customers leave the site after a single failedsearch. Therefore, improving the search capability on a site directlyincreases revenue through increased customer acquisition, retention, andsales.

While all web-sites experience some form of these search problems tosome extent, the problem is particularly acute for web-sites with a deepand rich variety of content or products. Examples are electronicprocurement networks, financial sites, sporting goods stores, grocerysites, clothing sites, electronics, software, and computer sites, amongmany others. Another class of sites with a deep search problem comprisesof those carrying highly configurable products such as travel andautomotive sites. Ironically, as a rule of thumb, the more a web-sitehas to offer, the greater the risk that customers will leave the sitebecause of a failed search.

When a customer physically enters a large department store, she can aska clerk where she can find what she is looking for. The clerk's “search”is flexible in that he can understand the customer's question almost nomatter how it is worded. Moreover, the clerk's “search” is generallyaccurate since the clerk can often specifically identify a product, orinitial set of products, that the customer needs. Searches on web sitesneed to be equally flexible and accurate. In order for that to happen, avisitor's request must be understood not only in terms of the products,but also in terms of the request's parameters or characteristics.However, conventional information retrieval systems for web-site contenthave been unable to achieve this.

Some of the conventionally used methods used to find goods and serviceson web sites, and some problems with these conventional methods areoutlined below:

1. Keyword-based search: In this method, users type a set of words orphrases describing what they want to a text box, typically on the mainpage of the site. A program on the site then takes each individual wordentered (sometimes discarding “noise” words such as prepositions andconjunctions), and searches through all pages and product descriptionsto find items containing either any combination of the words. Thismethod, when given an English sentence or phrase, either returns far toomany results or too few. For example, if a customer requests, “show memen's blue wool sweaters,” the search could be unsuccessful for thefollowing reasons. It would either return only those pages that containall the words in this request, or return any page that contained anysingle word in the search. In the former case, no items would be found,though there might be many products with those characteristics for sale.For instance, it is possible that aqua cashmere cardigan would not bematched, since it contains none of the keywords. In the latter case, alarge number of items would be found, most of which would be of nointerest to the customer. For example, blue wool slack may beincorrectly matched, since it contains the keywords “blue” and “wool.”Some keyword-based searches weight results based on how many keywordsare matched.

Keyword-based approaches are widely used in medical transcriptionapplications, database access, voice-mail control and web search.Virtually all commercial natural-language interface products use thisapproach. In this approach, certain words are regarded as meaningful,and the remainder as meaningless “glue” words. Thus, for example, in thesentence “show all books written by Squigglesby” the words “show,”“book,” and “written” may be regarded as keywords, the word “by” as ameaningless glue word, and the word “Squigglesby” as an argument. Thequery would then be formed on the theory that a book author namedSquigglesby was being requested.

In such systems, keywords are generally some of the common nouns, verbs,adverbs and adjectives, and arguments are proper nouns and numbers.There are exceptions, however. Prepositions are usually regarded as gluewords, but in some circumstances and in some systems are regarded askeywords. Generally, this is due to the human tendency to omit words insentences, known in the argot as “ellipses.” The sentence “Show allbooks by Squigglesby” is an example of this, where the verb “written” isexcluded. In order to cope with this, some keyword-based systems make“by” a keyword.

There are a few specialized cases of, or variations on, keywordsearches. Database approaches are an example of a widely used variant onkeyword-based approaches. In these systems, the database developerassociates keywords or identifiers with specific database fields(columns in specific tables). Various words, specifically interrogativepronouns and adjectives, some verbs, and some prepositions, have fixedmeanings to the database query program. All other words can be availableas keywords for a template-based recognition system. In response to auser's sentence, the interface system may match the user's sentence to atemplate set constructed from the database developer's information aboutdatabase structure and identifiers, and its built-in interpretation ofits hardwired keywords. A Structured Query Language (SQL) statementwould then be generated which encodes the meaning of the user'ssentence, as interpreted by the interface system.

Another example of a specialization of the keyword-based approach is acatalog-based approach. Catalogs are databases of products and services.A “category” is the name of a table: the attributes of the category aresome columns of the table. In this approach, a question is firstsearched by a category word, and then the remainder of the question isused as keywords to search for matching items within the category. Forexample, “blue woolen sweater” would first search for “blue” “woolen”and “sweater” as keywords indicating a category, and then (assuming“sweater” succeeded as a category keyword and the others did not), for“blue” and “woolen” as keywords within the sweater category. Thedifficulty with this approach is that cross-category queries fail, sinceno individual category is available to match in such cases. Further,parameters that are not present in the product descriptions in thecategory are not used.

Some of the central limitations of keyword-based systems are describedbelow:

Meanings of words are fixed, independent of context. In keyword-basedsystems, keywords have fixed semantics. This is a distinct departurefrom the use of normal language by humans. Words in natural languagederive their meaning through a combination of “symbol” (the word itself)and “context” (the surrounding text and background knowledge). The mostglaring example is prepositions in the presence of ellipses. Forinstance, “by” can indicate the subject of almost any transitive verb,as well as physical proximity or indicating an object or method to useto accomplish a particular task. Another example of meaning dependent oncontext is that “green” can refer to a color, a state of freshness ornewness, or, disparagingly, to inexperience. A quick glance at any pageof any dictionary will show that most words have multiple, and oftenunrelated, meanings, and context is what disambiguates them. Contrary tothis nuanced usage of words, in general, keyword-based approaches chooseone single meaning for each word, and apply that meaning consistently inall searches. This problem is fundamentally unfixable in these systems:in order to attach a contextual semantic to a word, strong parsingtechnology is required and a means must be found of specifying a word incontext, sufficient for a program to understand the contextual meaning.

Strongly tied to an application. Since the meanings of words must befixed so strongly, these systems have the interface strongly tied to(and, in general, inseparable from) the application. There is no toolkitcomparable to the popular Graphical User Interface (“GUI”) toolkits toform a keyword-based natural-language interface to an arbitraryapplication.

Missed meanings attached to glue words, especially prepositions. Anassumption behind keyword-based approaches is that glue words carry nomeaning or semantic content. Unfortunately, in practice there are veryfew words whose meanings are always unimportant. The words chosen asglue words are those whose meaning is most context-dependent, and thustheir semantic content is largely missed.

High error rates, non-robust. Since meanings are attached to wordsindependent of context, meanings can often be guessed wrong. Forexample, one vendor in this space, Linguistic Technology Corporation,distributes a product (“EnglishWizard”) that permits database users toask questions of a database. A demonstration is given with a database ofpurchasers, employees, sales, and products. In this example database,numbers always refer to the number of employees. This produces asequence where, when a user asks “who purchased exactly two items,” theanswer is “no one.” However, when a user asks how many items aparticular individual purchased, the answer is “two.” The reason for thediscrepancy could be that EnglishWizard did not really understand thequestion. Instead, the first user question was mapped to a questionabout employees since it included a number in it.

2. FREE-FORM KEYWORD SEARCH: This category replaces keywords withpreviously-asked questions and the “right” answers, and returns theanswers to the typed-in question. Examples of such systems are describedin detail in U.S. Pat. No. 5,309,359, entitled “Method and Apparatus forGenerating and Utilizing Annotations to Facilitate Computer TextRetrieval,” issued on May 3, 1994 to Katz, et al., and U.S. Pat. No.5,404,295, entitled “Method and Apparatus for Utilizing Annotations toFacilitate Computer Retrieval of Database Material,” issued on Apr. 4,1995 to Katz, et al. In systems employing free-form keyword searching,questions and answers are stored as sets. The question is typicallystored in a canonical form, and a rewrite engine attempts to rewrite theuser question into this form. If the user question maps into apre-determined question for which the answer is known, then the answeris returned by the system. Such an approach is used byhttp://www.AskJeeves.com for Web searching applications, and for lookupsof frequently-asked questions (FAQs).

Such systems have several limitations, including the following:

A relatively small number of questions can be answered: The number ofquestions that can be answered is linearly proportional to the number ofquestions stored—thus, this method can only be used when it isacceptable to have a relatively small number of questions that can beanswered by the system.

Cannot directly answer a user's question: Since such a system processesa user question in toto, and does not attempt to parse it or extractinformation from the parts, it cannot be used where the solution to theuser question requires the use of a parameter value that can beextracted from the question. In sum, the system can merely point theuser at a page where his question can be answered—it cannot directlyanswer the user question.

3. UNDERSTANDING-BASED SEARCHES: Systems incorporatingunderstanding-based searches attempt to understand the actual meaning ofa user's request, including social and background information. Anexample of such a system is Wilensky's UNIX-based Help system, UC. UChad built into it a simple understanding of a user's global goals.Wilensky explained that a consequence of not having such a deepunderstanding was that the system might offer advice, which literallyaddressed the user's immediate question in a way that conflicted withthe user's global goals. A specific example is that a request for moredisk space might result in the removal of all the user's files—an actionthat met the immediate request, but probably not in a way that the userwould find appropriate.

Understanding based systems are generally confined to conversationalpartners, help systems, and simple translation programs. In general, itshould be noted that the underlying application is quite trivial; infact, the interface is the application. Various specialized systems havealso been built, to parse specific classes of documents. A good exampleis Junglee's resume-parser. Researchers in this area have now largelyabandoned this approach. Indeed, the academic consensus is that fullunderstanding is “AI-complete”: a problem that requires a human's fullcontextual and social understanding.

There have been multiple previous attempts to use natural language as atool for controlling search and computer programs. One example of theseis Terry Winograd's “Planner” system, which was described in his 1972doctoral thesis. Winograd developed an abstract domain for his program,called the “Blocks World.” The domain consisted of a set of abstractthree-dimensional solids, called “blocks,” and a set of “places” onwhich the blocks could rest. Various blocks could also rest on top ofother blocks. Planner would accept a variety of natural languagecommands corresponding to the desired states of the system (e.g., “Putthe pyramid on top of the small cube”), and would then execute theappropriate actions to achieve the desired state of the system.Winograd's system accepted only a highly stylized form of English, andits natural-language abilities were entirely restricted to the blocks'domain. The emphasis in the system was on deducing the appropriatesequence of actions to achieve the desired goal, not on theunderstanding and parsing of unrestricted English.

A variety of programs emerged in the 1980's to permit English-languagequeries over databases. EasyAsk offers a representative program. In thissystem, the organization or schema of the database is used as aframework for the questions to be asked. The tables of the database areregarded as the objects of the application, the columns theirattributes, and the vocabulary for each attribute the words within thecolumn. Words that do not appear within the columns, includingparticularly prepositions, are regarded as “noise” words and discardedin query processing.

Such understanding-based systems have a variety of problems, includingthe following:

Ignored vital relationships: Database schemas are designed for rapidprocessing of database queries, not semantic information regarding thedatabases. Relationships between database tables are indicated byimporting indicators from one table into another (called “foreignkeys”). Using the relationships in the schema as a framework forquestions ignores some vital relationships (since the relationship isnot explicitly indicated by key importation).

Lost semantic information: Prepositions and other “noise” words oftencarry significant semantic information, which is context-dependent. Forexample, in a database for books, authors, and publishers, thepreposition “by” may indicate either a publisher or an author, and mayindicate the act of publishing or authoring a book.

In addition to the problems described above with respect to some of thedifferent approaches that currently exist for retrieving data, all ofthe above approaches share the limitation that the Natural Language(“NL”) interface for each application must be handcrafted; there is noseparation between the NL parser and interface, and the applicationitself. Further, development of the interface often consumes more effortthan that devoted to the application itself. None of the currentlyexisting approaches to NL interfaces is portable across applications andplatforms. There is no NL toolkit analogous to the Windows API/Java AWTfor GUIs, nor a concrete method for mapping constructs in NL toconstructs in software programs.

Thus, there exists a need for a system and method for creatingstructured parametric data from plain text, both for purposes of contentsynthesis and for purposes of data retrieval. Further, such a systemshould be portable across applications and platforms. In addition, sucha system should be able to support searches on any relevant criteriawhich may be of interest to a web-site's visitors, and by any arbitraryrange of values on any parameter. Further, there exists a need for asystem which updates seamlessly, invisibly, and rapidly to accommodate achange, when a website adds or modifies the products it offers.

SUMMARY OF THE INVENTION

The present invention provides a system, method, and an architecture forreceiving unstructured text, and converting it to structured data. Inone embodiment, this is done by mapping the grammatical parse of asentence into an instance tree of application domain objects. Inaddition, the present invention is portable across different applicationdomains.

A system in accordance with the present invention can be used forcreating structured data from plain text, to allow for the efficientstoring this structured data in a database. For example, from the freetext description of a number of products, the structured data (whichcould be an extracted object and its attributes) can be used to createindividual entries in a product database, and thus create content for anecommerce website or web market. Alternately, or in addition, such asystem can be used for creating structured data from a plain text query,for using this structured data to retrieve relevant data from adatabase. For example, a user's free text query can be converted to adatabase query that corresponds to the objects of the database and theirattributes. Such a system overcomes the limitations of conventionalsearch engines by accepting free form text, and mapping it accuratelyinto a structured search query.

The present invention recognizes that understanding natural language isneither required nor desired in generating structured data; rather, whatis desired is the ability to map natural language onto programstructure. Further, there is a natural relationship between the parse ofthe sentence as expressed in a parse tree and a component tree in aprogram. Thus, the natural language sentence is understood asinstructions to build a component tree. A content engine takes in anatural language sentence and produces a program component tree. Thecomponent tree is then further simplified before it is passed to aprogram for execution.

As mentioned above, a system in accordance with the present inventioncan be used across various applications. In the various embodiments ofthe present invention, the meaning of a word is dependent only on theapplication and the role of the word in the sentence. Thus, thedefinition of a word is largely the province of the applicationdeveloper. Briefly, words act as identifiers for components. A word in asentence serves as an identifier for program objects. As discussedabove, many words in English or other natural languages have multiplemeanings with the meanings dependent upon context. Similarly, for thepresent invention, a word may be used as an identifier for multipleobjects.

In one embodiment, the present invention transforms an English sentenceinto a set of software objects that are subsequently passed to the givenapplication for execution. One of the advantages of this approach is theability to attach a natural language interface to any softwareapplication with minimal developer effort. The objects of theapplication domain are captured, in one embodiment, by using the NaturalMarkup Language (“NML”). The resulting interface is robust andintuitive, as the user now interacts with an application by enteringnormal English sentences, which are then executed by the program. Inaddition, an application enhanced with the present inventionsignificantly augments the functionality available to a user.

When given a plain text sentence in a natural language, a system inaccordance with one embodiment of the present invention performs thefollowing steps:

(i) A parsing algorithm applies a formal context-free grammar for thenatural language to derive all parses of a given sentence. For purposesof discussion, English is used as an example of the natural language ofthe plain text. However, it is to be noted that the present inventionmay be used for any natural language. In one embodiment, all parses ofthe sentence are derived in the time taken to derive a single parse(e.g., concurrently). Preferrably all parses are stored in a single datastructure whose size is dramatically smaller than the number ofindividual parse trees, often just a constant factor larger than thesize taken to store a single parse tree. It is to be noted that, in oneembodiment, the correct map of a sentence is only known after allpossible parses have been attempted.(ii) A mapping algorithm then uses the structure of each parse tree fora given sentence to attempt to derive an object representation of thesentence within the domain of interest based on the application-specificNML model. In other words, the mapping algorithm maps each parseoutputted by the parser, into an instance tree of objects. In oneembodiment, this is done by generating instance trees, mapping eachparse onto an instance tree, pruning the instance trees generated, andthen using a best-match algorithm on the pruned trees to select the bestmatch.(iii) A reduced form of the NML object description instance is createdas an instance of a Domain Markup Language (“DML”). This DML is passedto the application program for execution.

The features and advantages described in this summary and the followingdetailed description are not all-inclusive, and particularly, manyadditional features and advantages will be apparent to one of ordinaryskill in the art in view of the drawings, specification, and claimshereof. Moreover, it should be noted that the language used in thespecification has been principally selected for readability andinstructional purposes, and may not have been selected to delineate orcircumscribe the inventive subject matter, resort to the claims beingnecessary to determine such inventive subject matter.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is an illustration of the architecture of a system in accordancewith an embodiment of the present invention.

FIG. 2 is a block diagram of the components of the content engine.

FIG. 3A is an example of a parse tree for “abb” using a first grammar.

FIG. 3B is an example of two different parse trees for “abb” using asecond grammar.

FIG. 3C illustrates how various parse trees can be represented as asingle parse DAG.

FIG. 4 is a flowchart illustrating the functionality of the contentengine.

FIG. 5A illustrates one possible parse tree for the sentence “The boyhelped the girl with the suitcase.”

FIG. 5B illustrates another possible parse tree for the sentence “Theboy helped the girl with the suitcase.”

FIG. 5C illustrates how the different parse trees for the sentence “Theboy helped the girl with the suitcase” can be represented as a singleparse DAG.

FIG. 6 is a flowchart illustrating the generation of instance trees bythe mapper.

FIG. 7 illustrates the pruning of invalid instance trees after allinstance trees have been generated by the mapper.

FIG. 8 illustrates a cost function employed by the mapper to pick thebest map from the valid instance trees in accordance with an embodimentof the present invention.

FIG. 9 is a flowchart illustrating DML generation in accordance with oneembodiment of the present invention.

The figures depict a preferred embodiment of the present invention forpurposes of illustration only. One skilled in the art will readilyrecognize from the following discussion that alternative embodiments ofthe structures and methods illustrated herein may be employed withoutdeparting from the principles of the invention described herein.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS I. System Architecture

FIG. 1 illustrates an overview of the architecture of a system inaccordance with one embodiment of the present invention. The systemcomprises a content engine 110, an online dictionary 120, a domaindictionary 130, a Natural Markup Language (“NML”) module 140, a verticaldomain concepts module 150, a custom client specifications module 160, agrammar storage 170, and a client data module 182.

The content engine 110 receives as input plain text, parses it, and mapsthe parses into instance trees. As can be seen from FIG. 1, in oneembodiment of the present invention, the content engine 110 receivesinput from both the online dictionary 120 (which includes words in anatural language), and a domain dictionary 130 (which includes termsspecific to a domain).

In addition, the content engine 110 receives input from the NML module140, which contains an NML model specific to the application or domainfor which the system is being used. The application-specific NML iscreated, in one embodiment, using a combination of automatic and manualediting from the vertical domain concepts obtained from the verticaldomain concepts module 150, and the custom client specificationsobtained from the custom client specifications module 160. The presentinvention is customized to a vertical domain 150 of application bycreating an object oriented data model that represents the intendedfunctionality of the site. An example of the vertical domain concepts150 is taxonomy such as the United Nations Standard Product & ServicesCode (UN/SPSC). Another example of the vertical domain concepts 150 isthe set of concepts that are pertinent to financial information for acompany such as, company name, location, officers, products,competitors, annual sales, revenues, employees, etc. An example ofcustom client specifications 160 is a collection of concepts similar tothe vertical domain concepts 150, but specific to a web-site (i.e. notfound on all web-sites that may be in the same domain).

In addition, an input to the content engine 110 is also provided by thegrammar storage 170. The grammar storage 170 stores a grammar for aparticular language. In one embodiment, the grammar storage 170 stores afull context-free grammar for the English language. An example of such agrammar is included in the computer program listing appendix in filegrammar.txt. The grammar shown in grammar.txt has its start symbol as<Paragraph>. The rules indicate that a <Paragraph> is composed of one ormore <Sentence> symbols separated by <Terminator>. Similarly, a<Sentence> is composed of a <Clause> and so on. Grammars are discussedin greater detail below.

The content engine 110 also has access to a module containing clientdata 182. This data is used for client-specific or dynamic vocabularythat does not transfer across client sites or applications. Examples ofsuch vocabulary include trade or brand names (e.g. “Explorer”,“Expedition”, or “Excursion” for Ford sport utility vehicles, or thenames of confections made by Hershey Foods Company).

FIG. 2 illustrates the architecture of the content engine 110 in anembodiment of the present invention. As can be seen from FIG. 2, thecontent engine 110 comprises a parser 210, a mapper 220, and a DomainMarkup Language (“DML”) generator 230.

The parser 210 parses the text input by the user into all possibleparses, based on the grammar stored in the grammar storage 170. In oneembodiment, the parser 210 applies a formal context-free grammar for thelanguage in which the user is working, to derive all parses of a givensentence. In one embodiment, all parses are derived in the time taken toderive a single parse. In a preferred embodiment, all of the parses arestored in a single data structure of size equivalent to that taken tostore a single parse tree. The parser 210 may generate meaninglessparses, but this is acceptable because, as will be discussed below,these meaningless parses will not yield valid mappings into the NML andwill be automatically discarded from consideration during the mappingprocess. The functionality of the parser 210 is discussed in greaterdetail below.

The mapper 220 accesses all the parses of the text input by the userproduced by the parser 210. The mapper 220, in turn, uses the structureof each parse tree for a given sentence to attempt to derive an objectrepresentation of the sentence within the domain of interest based onthe application-specific NML model provided by the NML module 140. Inother words, the mapper 220 maps each parse outputted by the parser 210,into an instance tree of objects. The functionality of the mapper 220 isdiscussed in detail below.

In one embodiment, the result of the mapper 220 is not the final resultof the content engine 110. One more step remains: the DML generator 230reduces the structure produced by the mapper 220 to a simpler form. Thegeneration of the DML is directed, in one embodiment, by DML_ELEMENTdeclarations contained in the NML model provided by the NML module 140.The result of this process, described in detail below, is to produce adocument in the Domain Markup Language (“DML”). The DML description canthen be passed as an input to the underlying application (not shown inthe figures). In one embodiment, the application takes the DML input anduse it to populate a database, using each instance tree as thedescription of an entity (and its attributes) in the application domain,and creating the appropriate entries in the database. In anotherembodiment, the application takes the DML input and uses it as a queryon an underlying database, to retrieve entries (e.g., products) thatsatisfy the query, and hence match the user's interests (to the extentthat such interest is well expressed in the original text input).

II. System Functionality A. Background Information

Before discussing the functionality of an embodiment of a system inaccordance with the present invention, it will be helpful to discusswhat a grammar is, what NML is, and what DML is.

1. Grammar

Languages, both natural and computer, are described by means of a“grammar.” A grammar is a series of mathematical objects called“productions,” which describe mathematically the well-formed “sentences”of the grammar.

A simple example of a grammar, “Grammar1” is as follows:

-   -   S        AB    -   A        aA    -   A        a    -   B        bB    -   B        b

The symbols “S”, “A”, and “B” are called “non-terminals” or “phrases.”They represent purely abstract objects, which do not appear in anysentence in the language, but represent a group of symbols of a languagesentence. The symbols “a” and “b” represent words in the language, andare called “terminals” or “words.” By convention, every grammar has aphrase “S” for “sentence”, which appears alone on the left-hand side ofone production. A production is applied by replacing the left-hand sideof the production with the right-hand side in a string.

A sequence α of terminals is said to be derived from a sequence γ ofnon-terminals and terminals if α can be transformed into γ by applying asuccession of productions of the grammar. For example, for Grammar1,“aabb” can be derived from “aAbB” because the rules A

a and B

b, applied to aAbB yield aabb. A sequence of terminals, or a “sentence,”is said to be in the language of the grammar if it can be derived fromthe start symbol, S. For example, for Grammar1, the sequence “abb” is inthe language of the grammar, because S

AB

aB

abB

abb. Conversely, “abab” is not in the language, since no succession ofproductions can be used to derive “abab” from S.

In English and other natural languages, the non-terminals and terminalscorrespond intuitively to the standard grammatical objects learned by aschool child. The terminals are simply the words and punctuation symbolsof the language; the non-terminals are the standard phrase constructsand word types learned in elementary school: noun, verb, noun phrase,verb phrase, etc. The set of non-terminals in human languages tend to befairly limited; the set of terminals and the productions vary widely,and in their variance is the rich diversity of human language. Ingeneral, any sequence of non-terminals and terminals may appear oneither side of a grammar rule. However, grammars which exploit thisfreedom are computationally intractable. Thus various restrictions areoften placed on the form of the left-hand side and the productions whichmake parsing these restricted grammars computationally tractable.

Of particular interest are “context-free grammars,” which aredistinguished in that the left-hand side of each production isrestricted to be a single non-terminal. Grammar1 given above iscontext-free. In fact, it is of a slightly more restricted type:“regular”.

As will be explained in more detail below, the context-free grammar usedin one embodiment by the content engine 110 provides the minimal amountof grammatical information necessary to capture the correct parse of anygrammatically correct English sentence. The main intent of the grammaris to capture the correct parse of a sentence without attempting tounderstand the meaning (or semantics) of the sentence. The grammar isthus created to include every correct parse of every sentence in thelanguage. Naturally, for any single sentence this results in severalambiguous parses, only one of which is the (semantically) correct parseof the given sentence.

One skilled in the art will note that the grammar provided by thegrammar storage 170, in one embodiment, can be substantially compactedfrom a full grammar of the English language, so as to facilitate brevityof the grammar. For example, the grammar shown in grammar.txtcomprehensively ignores grammatical features like verb conjugations,plurality of nouns, tense, active or passive voice etc. This isacceptable because these features are irrelevant to the parse of asentence and are only needed if the semantics of a sentence were to beanalyzed in detail.

In grammatical analysis, the particular sequence of rewrite rules usedto derive the sentence is usually called the parse of the sentence. In acontext-free grammar, the parse of a particular sentence can berepresented mathematically as a “parse tree.”

FIG. 3A depicts an example of a parse tree for “abb”, using the Grammar1above. For an arbitrary grammar, a parse may not be unique. For example,consider now the Grammar2.

-   -   S        AB    -   S        CB    -   C        aB    -   A        aA    -   A        a    -   B        bB    -   B        b

Based on Grammar2, the string “abb” would have two distinct parses asdepicted by the two separate parse trees shown in FIG. 3B.

Such a grammar, which can result in multiple parse trees for a string,is said to be “ambiguous.” Most grammars for human languages areambiguous in this precise technical sense, for the excellent reason thathuman language is itself ambiguous. For instance, in the sentence “Theboy helped the girl with the suitcase,” the modifier “with the suitcase”can either apply to the girl, or to the act of helping. In general, amodifier can modify any part of the sentence. Resolution of ambiguitiesis an important problem in parsing, and will be discussed below.

Referring again to FIG. 3B, it can be noted that conventionally,different parses result in different parse trees. However, in accordancewith an embodiment of the present invention, all parses of a givensentence can be represented as a single parse Directed Acyclic Graph(“DAG”) 300. This is illustrated in FIG. 3C for sentence “abb”.

The dashed edges 310 of DAG 300 represent optional parses; selection ofa set encompasses a valid parse tree. By examining FIGS. 3B and 3C, itcan be seen that the two trees in FIG. 3B have a total of 14 nodes and12 edges; in contrast, the parse DAG shown in FIG. 3C has a total ofonly nine nodes and 11 edges. The space and time savings represented byusing the parse DAG are dramatic when there are hundreds or thousands ofparses, as is typical for English sentences. The space and time taken toconstruct the parse DAG is proportional to the number of distinct nodesin the component parse trees, whereas the space and time taken byconventional algorithms is proportional to the number of nodes of theparse trees.

2. Natural Markup Language (“NML”)

The approach of the present invention is based on describing the set ofconcepts of a specific application area or domain as a set of objects.Objects are grouped into two fundamental classes:

(i) Enumerations: These are objects defined by single words or fixedphrases in English over the given domain. A simple example of anEnumeration is the object Color, which is defined by the color words(e.g., red, blue, mauve) of everyday experience.

(ii) Composites: These are objects are defined as collections ofsub-objects. The sub-objects of a composite are called its “attributes.”One example of a composite is the object Desk, which can have attributesPrimitiveDeskWord (e.g., the enumerated object consisting of the worddesk and its synonyms), PedestalType (e.g., a composite describingwhether this desk has a right, left, or double pedestal), Dimension(e.g., a composite giving the height, width, and depth of the desk), Use(e.g., an enumeration consisting of executive, computer, student,secretary), and various other attributes describing the material,finish, and optional features of the desk.

NML is a language for declaring objects, enumerations, and the relationsbetween objects. In one embodiment, the NML programmer declares thecomposites and enumerations of the domain. In one embodiment, NML isbased on the Extensible Markup Language (“XML”) standard. It should benoted that the NML description of a domain describes a graph of objects,where the sinks of the graph (the nodes with no outgoing edges) are theEnumerations of the domain.

As discussed above with reference to FIG. 1, the NML module 140 providesan application-specific NML to the content engine 110. NML is a tool fordescribing an application's object hierarchy and the vocabulary by whichthe hierarchy is referenced in natural language to the content engine110. Because the meanings of words themselves are not relevant to theactual implementation of a system, the present invention can be used forvarious different applications. An NML document may be created for eachapplication, and, typically, a small special-purpose markup language forthe domain itself may be created. The markup language and the NMLdocument are strongly related. An NML document captures the concepts ofan application domain, and the markup language is designed to hold thevalues for those concepts for a particular query.

An example of such a markup language document (from the“CompanyProfileAPI” Markup Language) is shown below, corresponding tothe values for the query “Who is the director of human resources forMicrosoft in the United Kingdom?”

<COMPANY_PROFILE_API>     <API_COMPANY_PERSON>       <PERSON_FULL_NAMEGET_OPERATOR=“value”/>       <COMPANY_NAME SET_VALUE=“microsoft”/>      <LOCATION>         <COUNTRY SET_VALUE=“uk”/>       </LOCATION>      <PERSON_TITLE SET_VALUE=“boss”/>       <DIVISION SET_VALUE=“humanresource”/>     </API_COMPANY_PERSON>   </COMPANY_PROFILE_API>

In this example, it will be seen that the morphology and, in some cases,the actual words of the query have been eliminated; rather, the conceptsand values have been inserted in the document, and whether the userquery requested or set the specific value. In this case, the person'sfull name was requested, and the identifying information given was thecompany he worked for, the country he worked in, his conceptual title(“boss”) and his division (“human resources”). This is sufficientinformation to run a query to satisfy the user's request, but allknowledge of the actual English he used in stating his query (and allrequirements to parse it) have been eliminated.

As mentioned above, in one embodiment of the present invention, NML isan extension of the eXtensible Markup Language (XML). Briefly, XML isthe core of all tag-based markup languages. It is almost never usedstandalone, but is configured into an application-specific tag-basedmarkup language. Examples of such languages are the Mathematical MarkupLanguage, MML, and Commerce One's product interchange language.

An XML document consists of a set of “elements.” An element is a chunkof a document contained between an HTML-style tag and its matchingclosing tag. Unlike HTML, however, XML has no built-in tags—rather, theset of tags for a specific document are defined by its Document TypeDefinition, or DTD. The distinction between two separate XML extensionlanguages are, simply, their DTDs.

Let us introduce NML with a “Hello, world” program. Unlike mostprogramming languages, however, NML isn't good for printing “hello,world”; rather, it's good for recognizing “hello, world”. The programwhich recognizes “hello, world” appears below in Program1.

<?xml version=“1.0”?> <!DOCTYPE NML_MODEL >   <NML_MODEL DOMAIN=“HelloWorld1”  >     <COMMENT>     This file shows the simplestHello, World example     </COMMENT>     <ENUMERATION NAME=“HelloWorld”>      <IDENTIFIER LITERAL=“Hello, World”/>     </OBJECT>   </NML_MODEL>

Program1 above is extremely simple; it just recognizes an object indexedby the string “hello, world”, and maps it to the object “HelloWorld.”The IDENTIFIER element within the ENUMERATION element indicates that theLITERAL argument, when it occurs in the text, creates an instance of therelevant ENUMERATION. Thus, the phrase “hello, world” creates aninstance of the HelloWorld object, and this maps that exact phrase. Thisprogram, while simple, recognizes only the exact phrase “hello, world”with various capitalizations. A simple program which recognized onlythis exact phrase would have served as well, and been far simpler towrite. However, in NML, a program which recognizes much more is almostas easy to write. This is shown in the next example in Program2.

<?xml version=“1.0”?> <!DOCTYPE NML_MODEL >   <NML_MODEL DOMAIN=“HelloWorld2” >     <COMMENT>     This file shows a non-workingHello, World example     </COMMENT>     <OBJECT NAME=“HelloWorld”>      <ATTRIBUTE MIN=“1” MAX=“1” INFER=“false”       ID=“Greeting”/>      <ATTRIBUTE MIN=“1” MAX=“1” INFER=“false”       ID=“World”/>    </OBJECT>     <ENUMERATION NAME=“Greeting”>       <IDENTIFIERLITERAL=“hello”/>       <IDENTIFIER LITERAL=“hi”/>       <IDENTIFIERLITERAL=“greeting”/>       <IDENTIFIER LITERAL=“good morning”/>      <IDENTIFIER LITERAL=“good afternoon”/>     </ENUMERATION>    <ENUMERATION NAME=“World”>       <IDENTIFIER LITERAL=“world”/>      <IDENTIFIER LITERAL=“everyone”/>       <IDENTIFIERLITERAL=“everybody”/>     </ENUMERATION>   </NML_MODEL>

Program2 above declares an object HelloWorld with two sub-objects, orATTRIBUTES: Greeting and World. Greeting is indexed by the literals“hello”, “hi”, “good morning”, and “good afternoon”; World by“everyone”, “everybody”, and “world”. The MIN=1 argument to bothATTRIBUTES indicates that any object of type HelloWorld must have both aGreeting and World ATTRIBUTE. The sentence “Hello”, for example, willnot match, because the World ATTRIBUTE would be missing. Similarly,MAX=1 indicates that only one ATTRIBUTE of each type can be present:“Hello everyone good afternoon” would be unmapped, since two Greetingobjects would be created to be sub-objects of HelloWorld.

Program2 when implemented by the content engine 110, is designed torecognize the following phrases.

Hello, world Hi, world Good morning, Good afternoon, world world Hello,everyone Hi, everyone Good morning, Good afternoon, everyone everyoneHello, everybody Hi, everybody Good morning, Good afternoon, everybodyeverybody

However, Program2 does not quite work to recognize these phrases. Infact, Program2 recognizes nothing. Rather, the Program3 below, whichdiffers from the Program2 by a single word, does in fact recognize theabove phrases.

<NML_MODEL  DOMAIN=“HelloWorld2” >     <COMMENT>     This file shows aworking Hello, World example     </COMMENT>     <OBJECTNAME=“HelloWorld”>       <ATTRIBUTE MIN=“1” MAX=“1” INFER=“false”      ID=“Greeting”/>       <ATTRIBUTE MIN=“1” MAX=“1” INFER=“true”      ID=“World”/>     </OBJECT>     <ENUMERATION NAME=“Greeting”>      <IDENTIFIER LITERAL=“hello”/>       <IDENTIFIER LITERAL=“hi”/>      <IDENTIFIER LITERAL=“greeting”/>       <IDENTIFIER LITERAL=“goodmorning”/>       <IDENTIFIER LITERAL=“good afternoon”/>    </ENUMERATION>     <ENUMERATION NAME=“World”>       <IDENTIFIERLITERAL=“world”/>       <IDENTIFIER LITERAL=“everyone”/>      <IDENTIFIER LITERAL=“everybody”/>     </ENUMERATION> </NML_MODEL>

As can be seen from examining Program2 and Program3, the change is inthe World ATTRIBUTE of the HelloWorld OBJECT: in Program3, the INFERargument is set to true. Inference is when the presence of a modifiercan imply the existence of an object, even when the object is notexplicitly identified. Here this means that whenever a World OBJECT iscreated, a HelloWorld OBJECT will be created containing it. This is thesecond of the two methods by which OBJECTs are created: the first, whichhas already been described, is when an IDENTIFIER is encountered. InProgram3, Greeting and World objects were created, but no HelloWorldobject; in fact, in that program, no HelloWorld object could be created,since it had no IDENTIFIERS, nor was it INFERred from any ATTRIBUTE.

The difference in behavior between Program2 and Program3 is due to oneother factor: in Program3, all nouns and verbs in a sentence must bematched in a tree rooted in a single object, or the sentence as a wholeis not considered mapped.

As mentioned above, NML is the means by which the application developerdescribes the structure of his application to the content engine 110. Inmany ways, it is equivalent to defining an Application ProgramsInterface (API) for the application, with a key property, in oneembodiment, that the “application programmer” in this case is a userspeaking a specific language (e.g., English). Thus, the API is verysimple: it encapsulates only those objects and attributes which a usercan create with a single English sentence and which would be expected tobe known by users of the application. For example, in a furniturecatalog, the NML would describe objects such as Desk, which can haveattributes such as PrimitiveDeskWord (e.g., the enumerated objectconsisting of the word desk and its synonyms), and PedestalType (e.g., acomposite describing whether this desk has a right, left, or doublepedestal).

In one embodiment, an NML file thus looks similar to a Java interfacefile or a C++ .h file: it is a description of the objects of anapplication, without their implementation. The object hierarchydescribed in the NML file is in logical structure and function very muchthe programmer's object hierarchy for the application: a few additionalobjects are typically added to provide targets for English mapping. Thissection concerns itself with the raw structure of NML: the means bywhich this is deployed in an application will be seen below.

The easiest way to look at NML is to start with its document typedefinition (DTD) given below.

<!DOCTYPE NML_MODEL [  <!ELEMENT NML_MODEL (COMMENT?,IMPORT*,(OBJECT|ENUMERATION|CALLBACK| PATTERN|COMMENT|DML_CALL)*)>  <!ATTLIST NML_MODEL   DOMAIN CDATA#REQUIRED   GENERATE_PEER (true | false | TRUE | FALSE | True | False)  “true”>  <!ELEMENT IMPORT EMPTY>  <!ATTLIST IMPORT   FILE CDATA#REQUIRED>  <!ELEMENT OBJECT (COMMENT?,ATTRIBUTE*)>  <!ATTLIST OBJECT  NAME CDATA #REQUIRED   EXPR (true | false | TRUE | FALSE | True |False) “true”   SINGLETON (true | false | TRUE | FALSE | True | False)“false”   ROOT (true | false | TRUE | FALSE | True | False) “false”  DML_ELEMENT CDATA #IMPLIED   DML_ATTRIBUTE CDATA #IMPLIED   DML_VALUECDATA #IMPLIED   PEER (true | false | TRUE | FALSE | True | False)“true”>  <!ELEMENT ENUMERATION (COMMENT?,IDENTIFIER*)>  <!ATTLISTENUMERATION   NAME CDATA #REQUIRED   EXPR (true | false | TRUE | FALSE |True | False) “true”   ROOT (true | false | TRUE | FALSE | True | False)“false”   DML_ELEMENT CDATA #IMPLIED   DML_ATTRIBUTE CDATA #IMPLIED  DML_VALUE CDATA #IMPLIED   PEER (true | false | TRUE | FALSE | True |False) “true”>  <!ELEMENT COMMENT ANY>  <!ELEMENT IDENTIFIER EMPTY> <!ATTLIST IDENTIFIER   MAP CDATA #IMPLIED   LITERAL CDATA #REQUIRED  UNKNOWN (true | false | TRUE | FALSE | True | False) “false”   TYPE(Interrogative | Adjective | Verb | Noun | Adverb | Pronoun |Preposition | Literal) REQUIRED>  <!-- An ATTRIBUTE can be an OBJECT,ENUMERATION,  OR CALLBACK -->  <!ELEMENT ATTRIBUTE EMPTY>  <!ATTLISTATTRIBUTE   INFER (true | false | TRUE | FALSE | True | False) “false”  MIN (0 | 1 | 2) “0”   MAX (1 | 2 | many) “many”   ID CDATA #REQUIRED  DML_ELEMENT CDATA #IMPLIED   DML_ATTRIBUTE CDATA #IMPLIED   DML_VALUECDATA #IMPLIED   PEER (true | false | TRUE | FALSE | True | False)“true”>  <!ELEMENT CALLBACK EMPTY>  <!ATTLIST CALLBACK   NAME CDATA#REQUIRED   EXPR (true | false | TRUE | FALSE | True | False) “true”  ROOT (true | false | TRUE | FALSE | True | False) “false”   CLASSCDATA #REQUIRED   TOKENIZER CDATA #REQUIRED   MAPPER CDATA #REQUIRED  DML_ELEMENT CDATA #IMPLIED   DML_ATTRIBUTE CDATA #IMPLIED   DML_VALUECDATA #IMPLIED   PEER (true | false | TRUE | FALSE | True | False)“true”>  <!ELEMENT PATTERN (REGEXP+)>  <!ATTLIST PATTERN   NAME CDATA#REQUIRED   EXPR (true | false | TRUE | FALSE | True | False) “true”  ROOT (true | false | TRUE | FALSE | True | False) “false”  DML_ELEMENT CDATA #IMPLIED   DML_ATTRIBUTE CDATA #IMPLIED   DML_VALUECDATA #IMPLIED   PEER (true | false | TRUE | FALSE | True | False)“true”>  <!ELEMENT REGEXP EMPTY>  <!ATTLIST REGEXP   STR CDATA #REQUIRED  SEP CDATA #IMPLIED>  <!ELEMENT DML_CALL (TRIGGER+)>  <!ATTLISTDML_CALL   NAME CDATA #REQUIRED>  <!ELEMENT TRIGGER EMPTY>  <!ATTLISTTRIGGER   NAME CDATA #REQUIRED> ]>

The NML_MODEL element is the root of the NML file. This contains a setof IMPORTs, and a set of OBJECTs. The DOMAIN argument to the NML_MODELelement is simply an indication to the content engine 110 of the name ofthe particular domain or application being processed by the contentengine.

Some elements that can be used in NML are discussed below.

FILE

The required FILE argument contains the path of the file to import. Atypical NML application contains a small set of custom objects and amuch larger set imported from standard libraries. A classic example isthe Date package, which recognizes common date phrasings: everythingfrom “the last week of the second quarter before last” to “12/19/98”. Inone embodiment, the IMPORT element directs a compiler to import alibrary from its FILE argument. For example, <IMPORTFILE=“Utils/Date.nml”/> imports the date package. The IMPORT element maylook like:

 <!ELEMENT IMPORT EMPTY>  <!ATTLIST IMPORT FILE CDATA #REQUIRED>

COMMENT

In an embodiment of the present invention, the COMMENT element is usedto denote an NML comment (as opposed to a general XML comment), and maybe attached to the model as a whole or to any single object. The COMMENTelement may look like:

-   -   <!ELEMENT COMMENT ANY>

OBJECT

The OBJECT element is the heart of NML. It may look like:

  <!ELEMENT OBJECT (COMMENT?ATTRIBUTE*>   <!ATTLIST OBJECT     NAMECDATA #REQUIRED     EXPR (true | false | TRUE | FALSE | True | False)“true”      SINGLETON (true | false | TRUE | FALSE | True | False)“false”     ROOT (true | false | TRUE | FALSE | True | False) “false”    DML_ELEMENT CDATA #IMPLIED     DML_ATTRIBUTE CDATA #IMPLIED    DML_VALUE CDATA #IMPLIED     PEER (true | false | TRUE | FALSE |True | False) “true”>

An OBJECT can be thought of as a type in a programming language. Unliketypes in programming languages, however, an object in NML has no realimplementation. Its purpose is to provide a target for the contentengine's 110 mapping of a word, a phrase or a sentence, and a source forthe Domain back end's mapping to the application's API. As such, itmerely needs provide type information: this is the type to which thephrase and sentence is mapped. The substructure of the Object elementgives the explicit instructions for mapping the phrase.

There are eight arguments to the Object element itself. The firstargument, NAME, is required, and gives the name of the Object. Allreferences to the Object, specifically those in ATTRIBUTE elements, aredone by the NAME of the Object.

The second argument, EXPR, refers to the ability of this object to formexpressions—phrases involving “and”, “or”, “;”, “/”, or “,”. “Monday orTuesday”, for example, forms an expression over the Weekday object. Suchexpressions are always formed over homogenous objects. Thus “Monday orDecember 23”, for example, would not form an expression over the Weekdayobject, though they would form an expression over a somewhat moreabstract object.

The PEER and DML_arguments control DML generation, described below.

The SINGLETON argument indicates that any instance of this object cantake only a single attribute. This is used when an object is, logically,an abstract superclass of several objects, only one of which can berepresented. The MAX attribute declaration (see below) is not adequateto control this case, since the MAX attribute declaration controls thenumber of instances of a single attribute object: this controls thenumber of attribute objects.

The ROOT argument indicates whether an instance of this object can be atthe root of an instance NML tree. An Object contains an optional comment(see above) and a set of ATTRIBUTES. If OBJECT is analogized to a typein a programming language, ATTRIBUTE is analogous to a member of thetype. Reference is by name. The declaration:

<OBJECT NAME=“HelloWorld”>  <ATTRIBUTE INFER=“false” MIN=“1” MAX=“1”ID=“Greeting”/gt;indicates that the HelloWorld object has a member of type (object name)Greeting. Note that there is no distinction between attribute name, typename, and member name—all refer simply to the object name of theattribute.

<!ELEMENT ATTRIBUTE EMPTY>  <!ATTLIST ATTRIBUTE   INFER (true | false |TRUE | FALSE | True | False) “false”   MIN (0 | 1 | 2) “0”   MAX (1 | 2| many) “many”   ID CDATA #REQUIRED>

As mentioned above, ATTRIBUTE declares a subobject or member of anobject. Thus, ID=“Greeting” says that this object contains a Greetingobject as a subobject. First-time NML programmers often comment thatthere is no distinction between the member name and type, in contrast tomost programming languages. To see this, consider the Java HelloWorldclass:

public class HelloWord {   public Greeting greeting;   public Everyoneeverybody; }

In contrast, the NML equivalent

<OBJECT NAME=“HelloWorld”>   <ATTRIBUTE INFER=“false” MIN=“1” MAX=“1”ID=“Greeting”>   <ATTRIBUTE INFER=“true” MIN=“1” MAX=“1” ID=“Everyone”></OBJECT>would correspond to:

public class HelloWord {   public Greeting;   public Everyone; }

To see why this is true, consider that the NML Object provides a targetfor mapping, and that member names distinct from types are only usefulwhen there is more than one object of a specific type as a member. Ifthis were the case in NML, the content engine 110 would be unable toknow which object to map to which attribute. In one embodiment, thisproblem may be solved by permitting multiple attributes of a specifictype, and letting the back end sort out their roles in the sentence.

ATTRIBUTE

The ATTRIBUTE element is empty, and has the following arguments:

ID: This argument refers to the object name of the attribute, and mustbe present. If the name is simple (a single word) it refers to an objectin the current NML_MODEL. If it is qualified, it refers to an objectfrom an imported model. Thus, for example, ID=“Date.Date” refers to theDate object of the (imported) Date NML_MODEL. In one embodiment, objectsreferenced from imported files must use the qualified name, even ifthere are no conflicts. Thus, for example, even if there were no “Date”objects except in the “Date” NML_MODEL, attribute IDs in any file thatimported “Utils/Date.nml” must reference the Date object as “Date.Date”.Qualifications of this form do not reference the directory structure atall: even if “Utils/Date.nml” appeared in the IMPORT declaration,“Date.Date”, not “Utils/Date.Date” would be the attribute ID of the Dateobject. Finally, qualifications are always single-level:“Utils.Date.Date” is not a valid attribute ID.

INFER: This argument, when true, instructs the content engine 110 toimmediately build this OBJECT whenever an object of the type named in IDis built. In the example:

<OBJECT NAME=“HelloWorld”>   <ATTRIBUTE INFER=“false” MIN=“1” MAX=“1”ID=“Greeting”>   <ATTRIBUTE INFER=“true” MIN=“1” MAX=“1” ID=“Everyone”></OBJECT>whenever an Everyone object is built, a HelloWorld object containing itas an attribute is also built. By contrast, the creation of a Greetingobject does not infer the creation of the HelloWorld object. The defaultvalue for INFER is false.

MIN: This argument indicates the minimum number of attributes of this IDthat this object must have. In the example, a HelloWorld object musthave at least one Greeting attribute and one Everyone attribute. Thevalues of MIN can be 0, 1, or 2, with a default of 0. The set ofpossible values may be expanded if a need is ever found. Often theminimum cardinality of an object is known. For example, a book must havea title. This can be exploited in the mapping process by deletingobjects which do not achieve the minimum cardinality for an attribute.

MAX: This argument indicates the maximum number of attributes of this IDthat this object must have. In the example, a HelloWorld object musthave at most one Greeting attribute and one Everyone attribute. Thevalues of MAX can be 1, 2, or many, with a default of many. The set ofpossible values may be expanded if a need is ever found. Often themaximum cardinality of an object is known. For example, a book must haveonly one title. This can be exploited in the mapping process by deletingobjects which do exceed the maximum cardinality for an attribute.

An extended example using NML is included in the attached appendix onthe CD, which is hereby incorporated by reference herein.

3. DML

The NML document produced the mapper 220 can, however, be too cumbersomefor easy processing. In one embodiment, the mapping algorithm describedin detail below creates a node in the NML instance object for eachphrase successfully mapped. Some of these phrases have no semanticsignificance in the sentence. Moreover, many separate phrasings may beused to create the same logical object. Since the NML objects areclosely tied to the phrasings used, multiple possible NML objects areused to denote the same logical object. Further semantic processing ofthe NML instance is required before the results can be used to populatea database or launch a search query.

Consider the NML models that recognizes an “ElectricalCurrent” object.There are many ways in English to specify a device's electrical current.One can refer to current or amperage; refer to the value as an Englishstring (“forty-five” or “one hundred and seventy five”) or as a number(45 or 175); attach the units implicitly (“amperage 65”) or explicitly(“current 65 amps”); or attached to the value (“65A”); and so on. Eachof these variations is captured in an NML model as a separate object;however, an application is dependent only upon the fact that current isspecified, the units specified, and the specified value. In the idealcase, this is captured as an XML element in a document:

-   -   <CURRENT UNIT=Amp VALUE=65/>

This element is an element of a Domain Markup Language designed forelectrical devices. It is automatically extracted from any NML instancecorresponding to a text fragment which describes the logical entity “65amps”.

The Domain Markup Language corresponding to an NML model is specified inthe NML model itself, with one specific NML Element and three attributedeclarations. These are described here:

DML_CALL <!ELEMENT DML_CALL (TRIGGER+)>   <!ATTLIST DML_CALL     NAMECDATA #REQUIRED>   <!ELEMENT TRIGGER EMPTY>   <!ATTLIST TRIGGER     NAMECDATA #REQUIRED>

This element directs the DML Generator 230 to begin a new DML instancewith a root element whose name is the required attribute of DML_CALL,whenever an NML Element whose name corresponds to a TRIGGER is detectedin the NML Instance. For example,

<DML_CALL NAME=”CURRENT”>  <TRIGGER NAME=”SimpleAmperageObject”/> <TRIGGER NAME=”SimpleCurrentObject”/> </DML_CALL>

Directs the DML Generator to begin a new DML Instance with root elementCURRENT whenever an instance of either a SimpleAmperageObject or aSimpleCurrentObject is detected in the NML Instance.

The following three attributes attach to any NML OBJECT, ENUMERATION,CALLBACK, PATTERN, or ATTRIBUTE, and control the creation of DMLElements and Attributes, and (optionally) setting the values of DMLAttributes. They are described below.

DML_ELEMENT

This attribute optionally appears with a name (e.g.,DML_ELEMENT=“Current”). If absent, the name is assumed to be the NAME ofthe NML OBJECT, ENUMERATION, PATTERN, or CALLBACK, or the ID of the NMLATTRIBUTE. It directs the creation of a DML Element of type name,whenever the corresponding NML structure is encountered in the NMLinstance. This differs from DML_CALL in that the DML Element is notcreated as the root of a new DML structure; rather, the new element isembedded as a subobject of any containing DML Element. This will beexplained in more detail, below, when the DML generation algorithm isexplicated.

Examples:

<OBJECT NAME=“Current” DML_ELEMENT=“CURRENT”>

Directs the creation of a DML Element named CURRENT whenever an NMLObject named Current is encountered in the NML Instance tree. Exactlythe same declarations would apply for ENUMERATION, CALLBACK, or PATTERN,with exactly the same effect.

<OBJECT NAME=”Current” DML_ELEMENT=”CURRENT”>  <ATTRIBUTEID=”AmpDeclaration” DML_ELEMENT=  ”Amperage”.../>

This declaration directs the creation of a DML Element named CURRENTwhenever an NML Object named Current is encountered in the NML Instancetree. In addition, if the Current object had an AmpDeclarationsubobject, then an Amperage DML_ELEMENT would be created as asub-element of CURRENT, as can be seen in the following:

NML Instance DML Instance <OBJECT NAME=”Current”...> <CURRENT...  <OBJECT NAME=”AmpDeclaration”>   <Amperage ...>     ...      ...  </OBJECT>    </Amperage> </OBJECT> </CURRENT>

DML_ATTRIBUTE

This attribute optionally appears with a name (e.g.,DML_ATTRIBUTE=“Current”). If absent, the name is assumed to be the NAMEof the NML OBJECT, ENUMERATION, PATTERN, or CALLBACK, or the ID of theNML ATTRIBUTE. It directs the creation of a DML Attribute of type name,whenever the corresponding NML structure is encountered in the NMLinstance. The new attribute is attached as an attribute of the nearestcontaining DML Element, generated either from a DML_CALL or DML_ELEMENTdeclaration. This will be explained in more detail, below, when the DMLgeneration algorithm is explicated.

Examples:

<ENUMERATION NAME=“VoltWord” DML_ATTRIBUTE=“VoltUnit” >   <IDENTIFIERTYPE=“Noun”   LITERAL=“gigavolt” UNKNOWN=“false” />   <IDENTIFIERTYPE=“Noun”   LITERAL=“kilovolt” UNKNOWN=“false” />   <IDENTIFIERTYPE=“Noun”   LITERAL=“megavolt” UNKNOWN=“false” />   <IDENTIFIERTYPE=“Noun”   LITERAL=“millivolt” UNKNOWN=“false” />   <IDENTIFIERTYPE=“Noun”   LITERAL=“volt” UNKNOWN=“false” />  </ENUMERATION>

The above code directs the creation of a DML Attribute named VoltUnitwhenever an NML Object named VoltWord is encountered in the NML Instancetree. The value of the attribute, unless directly specified by aDML_VALUE declaration (see below), is taken to be the literal whichgenerated the VoltWord object, and thus:

<ENUMERATION NAME=”VoltWord”>  <IDENTIFIER LITERAL=”gigavolt”/></ENUMERATION>generates the DML Attribute and value VoltUnit=“gigavolt”. This isattached to the containing DML_ELEMENT, e.g.

<OBJECT NAME=“Voltage” DML_ELEMENT=“Voltage” >  <ATTRIBUTE INFER=“true”MIN=“0” MAX=“1” ID=“VoltWord” /> ...  </OBJECT>

Coupled with the VoltWord declaration above, gives the following NMLInstance and DML instance for the word “gigavolt”, as illustrated below:

NML Instance DML Instance <OBJECT NAME=”Voltage”...> <Voltage <ENUMERATION NAME=”VoltWord”> Voltunit=”gigavolt”...>   <IDENTIFIERLITERAL=”gigavolt”/> ...  </ENUMERATION> </Voltage> </OBJECT>

DML_VALUE

DML_VALUE is an optional adjunct to DML_ATTRIBUTE, and permits an NMLprogrammer to override the default value assigned to an attribute by theDML Generation procedure. This is most often used when synonyms ormultiple phrasings can appear, and a normalized value is desired.

B. Functionality of the Content Engine

FIG. 4 is a flowchart illustrating the functionality of the contentengine 110 in accordance with an embodiment of the present invention. Ascan be seen from FIG. 4, the content engine 110 receives the input 410and tokenizes it. The parser 210 then creates 420 all the parse treesbased on the tokenized input and the grammar from the grammar storage170. Next, for each parse tree, the mapper 220 generates 430 an instancetree based on the application domain specific NML provided by the NMLModel Module 140. The mapper 220 then also prunes 440 the instancetrees, and then chooses 450 the best map. Finally, the DML generator 230uses this best map to generate 460 the appropriate DML. These steps arediscussed in detail below.

The functionality of the content engine 110 outlined in FIG. 4 can beused both for content synthesis and for retrieving data. For contentsynthesis, the input received 410 may, for instance, be a catalog ofitems (and their descriptions) offered by an e-commerce site. Forretrieving data, the input received 410 may, for instance, be a searchquery by a user. In the case of content synthesis, the DML generated 460may be used to populate a database, while in the case of data retrieval,the DML generated 460 may be used to search a database that has beenpreviously populated.

The input is tokenized 410 by the content engine 110. In one embodimentof the present invention, tokens are simply the words in the input text.However, multiple words may sometimes be treated as a single token, forexample, the two or more words that form a name such as San Francisco,or New York City. Multiple words that form a compound noun or otherconcepts such as dates, times, number patterns etc., may also beaggregated into a single token.

1. Parsing

Once the input is tokenized 410, the parser 210 generates parse treesfrom the tokenized input based on the grammar obtained from the grammarstorage 170. In one embodiment, the parser 210 creates all possibleparse trees.

The parser 210 creates parse trees, similar in form to the parse tree(conceptually) created by a compiler from a program. The leaves of thistree are the tokens (or words of the input text); the internal nodesrepresent phrases and subunits of the sentence, where each noderepresents the subunit containing all the tokens descended from thatnode. The root node represents the sentence itself.

To see in detail how this is done, consider the ambiguous sentence “Theboy helped the girl with the suitcase.” This sentence leads to two parsetrees, which are distinguished by the placement of the prepositionalphrase “with the suitcase.” In the first tree, the phrase “with thesuitcase” modifies the verb “help.” In the second tree, the phrasemodifies the noun “girl.” FIG. 5A depicts the first tree, while FIG. 5Bdepicts the second tree. In these descriptions, the boxes mark therecognized grammar symbols such as “SVO” (for Subject-Verb-Object), “NP”(Noun Phrase), and so on. The generating tokens are beneath thelowest-level boxes in the figure.

Consideration of FIGS. 5A and 5B reveals that the nodes of the trees arethe same, and are distinguished only by the edge into the noderepresenting the phrase “with the suitcase.” In the first case, the edge510 runs from the node representing the verb phrase “helped”; in thesecond case, the edge 520 runs from the node representing the phrase“the girl.” This aspect leads one to the conclusion that both parsetrees can be represented in a single parse Directed Acyclic Graph(“DAG”). The DAG is depicted in FIG. 5C. As can be seen from FIG. 5C,the DAG itself contains exactly the same number of nodes as each of thetwo component parse trees, and only one more edge than either of the twocomponent parse trees.

The parser 220 can employ any parsing algorithm. In one embodiment, theparsing algorithm of Cocke-Younger-Kasami may be used. Details of theCocke-Younger-Kasami algorithm can be found in the Introduction toFormal Language Theory, Harrison, M. A., Addison-Wesley, 1978. A sampleof the Cocke-Younger-Kasami algorithm is shown below in Tables 12 A-E.While the algorithm shown below provides a single parse of a sentence,it may be modified to generate all parses of the sentence.

The core of this algorithm is an (n+1)×(n+1) table, where “n” is thenumber of tokens in the parse. The tokens are here denoted a₀ . . .a_(n-1), and the table elements T_(0,0), . . . , T_(n,n). The upper halfof the table is filled from i,i+1 to n, n in the order given below. Theitems just above the diagonal are filled with the grammar nonterminalsthat directly derive the relevant token. The items in the remainingtoken are filled in as follows:

T _(i,j) ={A|

BC,BεT _(i,k) ,CεT _(k,j) ,i+1≦k≦j−1}.

The result of these equations is that, at the completion of thealgorithm, T_(i,j) contains precisely the set of nonterminals whichderive the phrase beginning at a_(i) and terminating in a_(j). T_(0nj)then contains the set of non-terminals which derive the entire sentence.

for (i = 0; i < n; i++) {  t[i][i+1] = {A | A=>a_(i)} } for (d = 2; d <=n; d++) {   for (i = 0; i <= n − d; i++) {    j = d + i;    for (k =i+1; k < j; k++) {           t[i][j] = t[i][j] ∪ {A | A=>BC,                 B ∈ t[i][k],                  C ∈ t[k][j]};          }       } }

It can be seen from the above pseudocode that the order of magnitude ofthe time taken by this parsing algorithm run is proportional to PN³,where N is the number of words in the sentence and P is the number ofdistinct parses. The algorithm is shown running on the string aabb,given the Grammar3.

-   -   S=>AB    -   S=>PB    -   P=>AS    -   A=>a    -   B=>b.

The initial matrix is shown below.

T_(0,0) T_(0,1) T_(0,2) T_(0,3) T_(0,4) A A B B

After the first iteration of the loop with loop variable d, the matrixis:

T_(0,0) T_(0,1) T_(0,2) T_(0,3) T_(0,4) A S, P A S S, P B B

After the final iteration, the matrix is:

T_(0,0) T_(0,1) T_(0,2) T_(0,3) T_(0,4) A S, P S, P A S S, P B B

The root of the parse tree is contained in the element T[0][4]—in otherwords, in the cell in the top-right corner of the matrix. At this pointthe parsing algorithm terminates and the correct parses are read fromthe top-right corner of the matrix.

2. Mapping

As discussed above, the mapper 220 generates 430 instance trees for eachparse tree based on the application-specific NML provided by the NMLmodule 140. In one embodiment, the mapper 230 then prunes 440 theseinstance trees to discard invalid and/or incomplete trees. The mapperthen chooses 450 the best map. Each of these steps is discussed indetail below.

An object in the instance tree is said to cover a node of the parse tree(equivalently, a node is said to “map” to an object), if the mapper 220matches the object to the node, by the rules explained below. The goalof the mapping algorithm is to map a single object to the root node ofthe tree. In one embodiment, if a single NML instance cannot be obtainedfor a sentence, the system switches to another mapping mechanism thattries to obtain the best set of disjoint NML instances that cover theentire sentence. There are several different methods to perform apartial map of a sentence.

a) Generation of Instance Trees

In one embodiment, instance trees are generated by starting out at theleaf (or terminal) nodes of a parse tree. In brief, a terminal node iscreated for each token. At each terminal node of a parse tree, allenumerated objects are indexed by the terminal word. An inferenceprocess is then executed to create inferred objects. The algorithm thenmoves up the parse tree, generating a new object at a parent node bycomposing the objects of the child nodes at the node. At each node thereis one child node that is predetermined to be the main child of thenode. The main child corresponds to the grammatical object that playsthe central role in the grammatical structure represented by the node.For a noun phrase, this is the head noun, for a prepositional phrasethis the prepositional complement, etc.

Objects can be generated in several ways. Specifically, objects can begenerated by enumeration from identifiers, enumeration from callbacks,and enumeration from patterns. In addition, objects can also be inferredfrom other objects. Let us consider each of these in turn.

Enumeration from Identifiers:

An Enumeration is an object created by the presence of a single word orphrase.

<!ELEMENT ENUMERATION (COMMENT?IDENTIFIER*)> <!ATTLIST ENUMERATION    NAME CDATA #REQUIRED     EXPR (true | false | TRUE | FALSE | True |False) “true”>

In the example shown below, the enumeration “Greeting” is created whenthe word “hello” is encountered, because of the code snippet:

<ENUMERATION NAME=“Greeting”>   <IDENTIFIER LITERAL=“hello”></ENUMERATION>

It is important to note that an Enumeration is in every way identical toan object, except for the fact that an object is always inferred from anexisting attribute and an Enumeration is inferred from a word or phrase.

The IDENTIFIER element recognizes a single word that forces creation ofthe object. The specific word is given in the LITERAL argument.

  <!ELEMENT IDENTIFIER EMPTY>   <!ATTLIST IDENTIFIER     MAP CDATA#IMPLIED     LITERAL CDATA #REQUIRED     UNKNOWN (true | false | TRUE |FALSE | True | False)     “false”     TYPE (Any | Adjective | Verb |Noun | Adverb |     Pronoun | Preposition) “Any”>

The IDENTIFIER element has no substructure, and can take the followingarguments, listed below:

LITERAL: This argument gives the literal string that maps to the object.In general, only the root of a specific verb or noun should appear inthe literal argument; the Content Engine will recognize and map tenses,declensions, and all derivative forms of verbs and nouns. For example,<IDENTIFIER LITERAL=“have”> will map “has”, “had”, “having”, “has had”,and so on, and <IDENTIFIER LITERAL=“woman”> will map “women”, “women's”,“womanly”, and so on. LITERAL is the only required argument ofIDENTIFIER, and will often be the only argument.

MAP: Occasionally, synonyms are used to indicate a single object, andthe semantic processing of the object is independent of which synonym isused. A good example is “stock” and “security”. In this case, theback-end code can be simplified if the synonyms are reduced to a singlecanonical case. MAP does this. If MAP appears, then the recognizedliteral will be mapped to the string that is given as the argument toMAP. The default value for MAP is the value of the LITERAL argument.

TYPE: This restricts the mapping to the particular part of speech givenas the argument. Often, words can assume several different parts ofspeech. For example, the word “green” is a noun (denoting a patch ofgrassy land or a color), an adjective, or a verb. It is often desired torestrict an IDENTIFIER to only one of these roles. If Verb is given asthe value of TYPE, then only verbs will map to this particularidentifier. The default value, ANY, maps any part of speech to thisIDENTIFIER.

Enumeration from Callbacks:

Another way in which objects can be created is from Callbacks. TheCALLBACK element functions in a fashion similar to ENUMERATION: it is ameans for mapping individual tokens in a sentence to OBJECTS. It isdesigned for the specific case where the set of IDENTIFIERs for aparticular OBJECT is very large, changes dynamically, or both.

<!ELEMENT CALLBACK EMPTY>   <!ATTLIST CALLBACK  NAME CDATA #REQUIRED  EXPR (true | false | TRUE | FALSE | True | False) “true”>   CLASSCDATA #REQUIRED   PARSER CDATA #REQUIRED   MAPPER CDATA #REQUIRED>

A good example of such a situation is the set of stock symbols, whichnumber in the thousands and which change daily due to IPOs, mergers, andname and symbol changes. For such sets, the use of IDENTIFIERs isunwieldy: the NML file would be very large and in a state of constantupdate. A better solution is to use a standard relational database, andcall it to recognize a stock symbol. The particular example for stocksymbols is:

  <CALLBACK NAME=“CompanyFundIndexDbName”   EXPR=“False”CLASS=“ecCallback.CompanyFundIndexNameDatabase”PARSER=“isCompanyFundIndexName” MAPPER=“findCompanyFundIndexSymbol”>       <COMMENT> Each company, fund, and index name        or symbol isobtained     via a callback to method that matches the names in adatabase.        </COMMENT>   </CALLBACK>

Formally, the CALLBACK element defines a Java class which contains atleast two methods: a method which takes a string and returns a boolean(this is named in the PARSER argument), and a method which takes astring and returns another string (this is named in the MAPPERargument). While this was specifically designed with a SQL interface inmind, there is no restriction in the code for this: any Java classhaving the appropriate methods will do.

In one embodiment, the CALLBACK element may have no structure, and havethe following arguments, all of which are required:

CLASS This is the name of the fully-qualified Java class containing thetwo methods referenced above. The Content Engine will call the method<CLASS>.<PARSER>(token); to recognize the token, and<CLASS>.<MAPPER>(token); (in the example above,“ecCallback.CompanyFundIndexNameDatabase.isCompanyFundIndexName(token);”for recognition, and“ecCallback.CompanyFundIndexNameDatabase.findCompanyFundIndexSymbol(token);” for mapping). Thus, the CLASS must be accessible to theContent Engine from the string as given here using the standard Javaclass loader methods.

PARSER This is the name of the method within CLASS called to do therecognition: it should take a single String argument and return aboolean. This functions exactly as the LITERAL argument to IDENTIFIER;Content Engine will pass the root form of the token, not the tokenitself, to the parser. Thus, the word “Microsoft's”, appearing in asentence, yields the call“ecCallback.CompanyFundIndexNameDatabase.isCompanyFundIndexName(microsoft)”.When this returns true, the behavior of the compiler is exactlyidentical to that produced when “microsoft” had appeared in a list ofIDENTIFIERs for this OBJECT.

MAPPER This is the name of the method within CLASS called to maprecognized tokens to a canonical form: it should take a String andreturn a String. This functions exactly as the MAP argument toIDENTIFIER. As with PARSER, Content Engine will pass the root form ofthe token, not the token itself, to the mapper. To obtain the defaultbehavior of IDENTIFIER, MAPPER should simply return its argument. Aricher example is the one cited:ecCallback.CompanyFundIndexNameDatabase.findCompanyFundIndexSymbolreturns the symbol associated with the name. So, for example,ecCallback.CompanyFundIndexNameDatabase.findCompanyFundIndexSymbol(microsoft) returns “msft”, as doesecCallback.CompanyFundIndexNaneDatabase.findCompanyFundIndexSymbol(msft).

In an alternate embodiment, CALLBACK 520 may be simplified if theContent Engine 110 adopts an interface-based protocol for its callbacks.In this case, the PARSER and MAPPER arguments to CALLBACK willdisappear, and the CALLBACK CLASS will be required to implement theContent Engine 110 callback protocol.

Enumeration from Patterns

A pattern is the third logical equivalent to an enumeration. This isused when a large number of identifiers can be specified by a regularexpression. A full description of regular expressions (formally, regularlanguages) can be found in Aho, Hopcrofi, and Ullman, Introduction toAutomata and Language Theory, Addison-Wesley, 1979. The most simpleexample of a regular expression is a Social Security Number, which isrepresented by the regular expression:

[1-9][0-9][0-9]-?[0-9][0-9]-?[0-9][0-9][0-9][0-9]

which indicates that a social security number is any string which beginswith a digit between one and 9, followed by two digits between 0 and 9,an optional dash, two digits between 0 and 9, and optional dash, andthen four digits between 0 and 9.

In one embodiment, the content engine 110 accepts any regularexpressions specified by the PERL 5 compiler (seehttp://www.perldoc.com/perl5.6/pod/perlre.html for the currentspecification). The regular expressions are captured in the STR argumentof the contained REGEXP element. Occasionally, it is useful to specifymultiple regular expressions in the same pattern, which are separated byan optional SEP character (space by default).

<!ELEMENT PATTERN (REGEXP+)> <!ATTLIST PATTERN   NAME CDATA #REQUIRED  EXPR (true | false | TRUE | FALSE | True | False) “true”   ROOT (true| false | TRUE | FALSE | True | False) “false”   DML_ELEMENT CDATA#IMPLIED   DML_ATTRIBUTE CDATA #IMPLIED   DML_VALUE CDATA #IMPLIED  PEER (true | false | TRUE | FALSE | True | False) “true”> <!ELEMENTREGEXP EMPTY> <!ATTLIST REGEXP   STR CDATA #REQUIRED   SEP CDATA#IMPLIED>

Inference:

Apart from the enumeration techniques discussed above, one more way inwhich an instance object can be created is by inference. Inference iswhen the presence of a modifier can imply the existence of an object,even when the object is not explicitly identified. This can occurthrough ellipsis, or, more commonly, because the underlying object isabstract and is not always (and perhaps never) explicitly identified.

Consider, for example, the generic object “Weather,” which hasattributes “Temperature,” “Precipitation,” “Outlook,” and “Location.”Though such an object may be explicitly identified (as, for example, bythe keyword “weather”) it will often not be, as in the question “What isthe temperature in San Francisco?” In this case, the request for the“Weather” object is inferred from the request for its attribute“Temperature.”

Not all attributes infer the presence of a modified object. In theexample above, the city San Francisco is a “Location” for “Weather,” butdoes not infer a “Weather” object. “Temperature,” however, does. Adeveloper declares that a particular attribute infers the existence ofthe object. In the map, inferred objects are created immediately alongwith the inferring attribute, along with an “inferred” tag.

In one embodiment of the present invention, inference is related to typeinference in an object-oriented language in a deep and non-obviousfashion. In general, if a type A is a subclass of a type B in anobject-oriented language, then every instance of A bears within it aninstance of type B. Put better, one can think of A as B with additionalproperties. Thus, creation of an instance of A forces the creation of aninstance of B. In some sense, then, the declaration of a sub-type in aprogram is a declaration of an inferencing attribute.

In an alternate embodiment, rather than encapsulating the inferencingattribute in a sub-type declaration, the inferencing attribute maydirectly infer the object. In this embodiment, the attribute can bedirectly recognized, and the inferred object can be built directly fromit.

As discussed above, the INFER element is an argument of an attribute,which, when true, instructs the content engine 110 to immediately buildthe OBJECT whenever an object of the type named in ID is built. In theexample:

<OBJECT NAME=“HelloWorld”>   <ATTRIBUTE INFER=“false” MIN=“1” MAX=“1”ID=“Greeting”>   <ATTRIBUTE INFER=“true” MIN=“1” MAX=“1” ID=“Everyone”></OBJECT>whenever an Everyone object is built, a HelloWorld object containing itas an attribute is often built. The default value for INFER is false.

As the objects are created, the “handle” of the instance tree must beadjusted. It may be helpful to define some terminology here. When anEnglish phrase or sentence is parsed, there is always a dominantelement. In the case of a subject-verb-object sentence, for example, thedominant element is the verb phrase; in the case of a noun phrase, it isthe head noun; in the case of an adjectival phrase, it is the adjective.This element is referred to as the head word or head phrase of thephrase.

As the mapper 220 progresses, it creates trees of objects centered onnodes of the parse tree. Such a tree of objects, centered on a node ofthe parse tree, is said to be a map of the node. The link between a treeof objects and the parse tree is a single object within the map, calledthe handle of the map. The handle of the map may be thought of as theroot of the map of the head phrase of the mapped node in the parse tree.Its role (and how the handle moves during the mapping process) will beexplained below.

There is a fundamental equivalence between the object attribute tree ina program and the modifier hierarchy in a parse tree of a sentence. Inthe parse of a sentence, various words are the anchors of their phrase.For example, in any noun phrase, the noun is the anchor. The othersub-phrases are the modifiers. The anchor of the phrase defines theobject in the component tree; the modifiers are attributes of theobject. If an object Girl had been declared with identifier “girl” andattribute Carrying with identifier “with”, then the sentence “the boyhelped the girl with the suitcase” would have its Object mapped to acomponent Girl with attribute Carrying. However, if Girl did not have anattribute Carrying then the object would have been mapped to a componentGirl.

The easiest way to see how an object grows by accumulating attributes isto imagine two objects of the same type as composing into a singleobject by merging their attributes. Consider the following snippet fromthe HelloWorld programs:

<OBJECT NAME=“HelloWorld”>     <ATTRIBUTE INFER=“true” MIN=“1” MAX=“1”    ID=“Greeting”>     <ATTRIBUTE INFER=“true” MIN=“1” MAX=“1”    ID=“Everyone”> </OBJECT>

In this case, both the Greeting object and the Everyone object create aHelloWorld object through the inference mechanism. Both of theseHelloWorld objects have a missing, required attribute: once merged intoa single object, the required attributes for both are complete.

Two objects that are unrelated in the sentence, for example, should notcompose: they refer to different semantic entities within the sentence,unless there is some overlying grammatical link between them. Considerthe sentence “hello, dolly and thanks, everyone.” The HelloWorld objectscreated by the Greeting object containing “hello” and the Everyoneobject containing “everyone” should not merge: this would imply thatthere was a single phrase containing both Greeting and Everyone, andthis is false. A second method that might be imagined would have anobject adding as attributes only the maps of the modifiers of its headphrase. However, in English the semantic analysis of a sentence oftencontains inversions of its grammatical structure. For example, in thesentence “Show me the price of Microsoft,” the main semantic object is“the price of Microsoft,” and the verb phrase “Show” is, semantically, amodifier. Nonetheless, in the parse the head phrase is “Show,” which ismodified by “the price of Microsoft.”

The rule used by the Content Engine 110 is very simple. A map may add asan attribute:

(1) The map of a modifier of its handle; or

(2) The map of a phrase modified by its handle.

In case (1), the handle remains unchanged. In case (2), the handle movesto the attribute, so that the handle remains at the map of the headphrase of the parse. Thus, in our example, assume that a Stock objecthad been created for the phrase “the price of Microsoft”. The handle ofthis map is the Stock object. “the price of Microsoft” modified the verb“show”, and so under rule (2) the Stock object can add a Show attribute.When it does, the handle of the map moves to the Show attribute of theStock object. In other words, the root of the map is no longer thehandle.

Occasionally, it's helpful to force the handle to move to the root ofthe map. This happens when the programmer can guarantee that no furtherattributes can be added to this map from the modifiers of the headphrase. A good example occurs in the case considered in the previoussection, where is clear that no further modifiers of the verb “show”will become attributes of the root Stock object. In order to permitthis, inference moves the handle of the map to the root of the map. Aninferred object's handle is always the root of the map.

Details of the Mapping Algorithm

Further details regarding the generation 410 of instance trees areoutlined in the flowchart depicted in FIG. 6. Based on theapplication-specific NML obtained from the NML module 140, the mapper220 starts the generation 410 of instance trees by considering oneprocess node 601. The mapper 220 first determines 602 whether the nodeit is considering is a leaf node. If the node is determined 602 to be aleaf node, the object array is initialized 604 with generated objects.

Once the object array is initialized 604 by objects generated byenumeration, the mapper 220 iterates 606-610 over all the objects in thearray. For each such existing object, all objects that can be “inferred”from the existing object are added 610 to the object array. “Inference”is the only other way in which instance objects are generated, asdescribed above. Once it is determined 606 that there are no moreobjects in the array, the object array is returned

Referring back to the determination 602 of whether the node beingprocessed is a leaf node, if the node is not a leaf node, the objectarray is initialized 614 to empty. The mapper 220 then determines 616whether all the children of the node have been processed. If all thechildren of the node have not been processed, the next child node isselected 618 and processed 620. The maps of the child node are copied622 to the object array, and the root of each copied object is set 624to the child node.

If all the children of the node have been processed, then the attachmentof attributes to objects is performed 626-648. Each object of the arrayis selected in turn as the object to which to attach attributes. Thisobject is denoted as obj and is indexed by the variable i. Each objectof the array is selected in turn using the index j initialized 630 tozero. The object indexed by j is examined 640 and is henceforth referredto as obj1. The goal of steps 640-648 is to determine whether obj1 canbe attached as an attribute of obj, and to perform the attachment if itis possible. First, obj is examined 642 to see if it has as an attributean object whose name is the name of obj1. If this is true, then thesecond test is performed 644: whether the handle of obj1 modifies thehandle of obj. If this is true, then obj1 is attached 646 as anattribute of obj. Following this, or if either of the tests 642, 644failed, the next item in the array is selected 648 as obj1 648.

Once the attributes have been attached to obj, the final step is thereassignment of obj's handle, steps 634-636. The handle of obj is set toobj itself if obj has been inferred; if not, the handle of obj is leftundisturbed.

b) Pruning of Instance Trees

In one embodiment, once the instance trees are generated 430, pruning440 is performed by the mapper 220 to discard invalid/incompleteinstance trees. In one embodiment, for each map, a list of the tokensmapped into the instance tree are recorded; an instance tree for thesentence which does not map all the verbs and nouns are discarded.

An algorithm employed for pruning in one embodiment of the presentinvention is demonstrated in the flowchart in FIG. 7. Pruning starts 701at the root of an instance tree. An array is designated 702 as the arrayof objects (i.e. components of the instance tree) associated with theroot of the parse DAG. The content engine determines 704 whether thereare any more objects in the array. As long as there are more objectsremaining in the array, obj is assigned 706 the next object in thearray. The content engine then determines 708 whether the obj covers allnouns and verbs in the sentence. If not, the object is deleted 710 fromthe array. If obj does cover all nouns and verbs in the sentence, thecontent engine determines 712 whether the MIN and MAX attributes of theobject are satisfied. If they are not satisfied, the object is deleted710 from the array. If these attributes are satisfied, the contentengine loops back to determine 704 whether there are any more objectsleft in the array. When such determinations have been made for all theobjects in the array, the array is returned 714. Thus, only thoseinstance trees that account for all the verbs and nouns of the givensentence, and which satisfy the MIN and MAX attributes, are retained.

In another embodiments, a different algorithm may be used to discardinstance trees. In still another embodiment, the step of pruning 440need not be performed at all.

c) Choosing the Best Map

Finally, the instance tree which reflects the best map within thespecified domain is chosen 450. FIG. 8 illustrates how the best map ischosen 450 in one embodiment of the present invention. One skilled inthe art will note that the “best” map can be chosen 450 in several otherways.

In the embodiment illustrated in FIG. 8, a cost function is used toimpose a partial order on maps of a sentence. The maps of the sentencewhich are maximal under this partial order are chosen to be the bestmaps of the sentence, and returned as the result(s) of the mappingprocedure.

The cost function in FIG. 8 compares two maps (map A and map B), andreturns which is the superior map. It consists of a set of eightcomparisons 810-880, run in order. The kth comparison in the sequence isused only if the preceding k−1 comparisons have resulted in ties; thus,it is a hierarchy of tiebreakers. These are, in order:

810: If the number of tokens covered by the two maps is not identical,the superior map is the map covering the most tokens. The reasoning hereis straightforward: a better map covers more tokens.

820: If #1 does not indicate the better map, choose the map whosetopmost expression (maps joined by the words “and” or “or”, or thepunctuation symbol “,”) is furthest from the root of the map. Thereasoning here is that a conjunction can bind two phrases of arbitrarysize. Consider, for example, the phrase “red feather and gold sheathpen”. This phrase is ambiguous: it could refer either to two objects (ared feather and a gold sheath pen) or to a single object (a pen with ared feather and a gold sheath). The two maps would be distinct—thefirst, two-object map, has its expression at the root; the second, onelevel down, joining attributes of a single object. This rule resolves infavor of binding phrases at the lower of the possible levels, i.e.,conjoining the smaller possible units. In this example, preferring thesecond map (pen with a red feather and a gold sheath) over the first.When a map has no expressions, the distance of an expression from theroot is taken to be infinite.

830: If the maps are equal under criteria #1 and #2, choose the map withthe least distance between the tokens. In an n-token text fragment,tokens are assigned indices. The leftmost token is assigned index 0, andthe token to the immediate right of the token with index i is assignedindex i+1. This rule chooses the map with the smallest difference inindex between the leftmost and rightmost tokens covered by the map. So,for example, given the phrase “red felt pen tip”, with indices red=0,felt=1, pen=2, tip=3, and map A covering “red felt tip” and map Bcovering “felt pen tip”, map B would be chosen as it has the leastdistance between its covered tokens (3−1=2 compared to 3−0=3). Thereasoning here is that compact maps are preferred over disjoined maps.

840: If the maps are equal under criteria #1-#3, choose the map with thefewer objects created by enumerations.

850: If the maps are equal under criteria #1-#4, choose the map with thefewer unused primitives—these are words and phrases in the text fragmentunused by the relevant map.

860: If the maps are equal under criteria #1-#5, choose the map with thefewer objects created by database lookup.

870: If the maps are equal under criteria #1-#6, choose the map with thefewer NML objects.

880: If the maps are equal under criteria #1-#7, choose the map with thefewer inferred objects.

If the maps are equal under all eight criteria, then they areincomparable (and thus equal) under the partial order, and are regardedas equally valid maps.

The different criteria of the cost function illustrated in FIG. 8 breakinto three distinct groups. The first group, comprising rules 1-2 and 5,are based on the structure of the sentence. Maps which use the mosttokens, contained in a compact group, are preferred over maps which usefewer tokens spread further over the text segment. Rule 3, as mentionedabove, resolves ambiguities with respect to expression phrases in favorof the tightest possible binding. Rules 4 and 6-8 comprise another majorgroup; and act together to prefer maps which have fewer objects.Together, they can be read as preferring maps with less structure overmaps with more created structure.

3. DML Generation

As discussed above, the data structure produced by the mapper 220 is aninstance of the domain described in the NML document. In one embodiment,this data structure is then used to generate DML. DML Generation is donein a depth-first fashion over the NML Instance tree. FIG. 9 is aflowchart that illustrates the generation 460 of DML.

The output of the mapper 220, described above, is a tree of NML objectinstances with enumerations in the leaves (actually, in general, it is acollection of such trees, since some maps can “tie” for the best map.Each tree is first pruned by removal of nodes that have no peers andwhose descendants have no peers: such nodes cannot generateDML_ELEMENTS, DML_ATTRIBUTES, or DML_VALUES. In one embodiment, at eachnode in the resulting pruned NML instance tree, the following algorithmis performed:

proc generateDML(NMLInstanceNode node) {  set savedElement = currentDML_ELEMENT  set savedAttribute = current DML_ATTRIBUTE  if (node is atrigger for a DML CALL) {      close & output all open DML_ELEMENTS     set the current DML_ELEMENT to the DML_CALL  } else if (node has aDML_ELEMENT) {      set newElement = new DML_ELEMENT with name      indeclaration      attach newElement to current DML_ELEMENT      setcurrent DML_ELEMENT to newElement  } else if (node has a DML_ATTRIBUTE){      set newAttribute = named attribute in declaration      setcurrent Attribute = new Attribute  }  if (node is a leaf) { set thevalue of the current Attribute to the identifying token  } else if (nodehas a DML_VALUE) {      set the value of the current Attribute to thenamed value  }  foreach child of node {      generateDML(child)  } close any DML_ELEMENT or ATTRIBUTE created by this node  set currentDML_ELEMENT = savedElement  set current DML_ATTRIBUTE = savedAttribute return; }

The generateDML process is called on each root node of each tree, inturn. Once it has completed on a root node, any open DML elements areclosed and output.

4. DML Used To Populate DBMSs, Retrieve Data, and Invoke Programs

Once the DML has been generated, it can be used in a variety ofdifferent ways, including populating a database system, retrieving datafrom a database system or other data store, or invoking a program usingthe parameters stored in the DML document as parameters to invoke theprogram. These various applications are illustrated in FIGS. 10-12. InFIG. 10, a description of a “black vinyl chair” 1030 is converted into astructured description 1060. The description is input into the ContentEngine, 1020, which produces a DML Document 1040. A DML ProcessingSystem 1050 then generates the structured description 1060. It will beobvious to one skilled in the art that the tabular form 1060 is suitablefor insertion into any database management system, including but notlimited to a relational database management system.

In FIG. 11, a natural language request for a “black vinyl chair” 1130 isconverted into a structured query 1160. The description is fed into theContent Engine 1120, which produces a DML Document 1140. A DMLProcessing System 1150 then generates the structured query 1160. Thestructured query here is shown in the database query language SQL. Itwill be obvious to one skilled in the art that the DML Processing System1150 could generate a query in any of a number of database languages,and is not restricted to SQL.

It is noted that here the NML model 1010 and the NML model 1110 areidentical: the same model is used for both content creation and contentquery. This illustrates the flexibility the robustness of the presentinvention.

In FIG. 12, a natural language request for a stock chart 1230 isconverted into a program invocation 1260. The description is fed intothe Content Engine 1220, which produces a DML Document 1240. A DMLProcessing System 1250 then generates the program invocation 1260. Theprogram invocation here is shown as an HTTP cgi request. It will beobvious to one skilled in the art that the DML Processing System 1250could generate a program invocation in any scripting, web, or APIenvironment, and is not restricted to HTTP requests.

Construction of a DML processing system such as 1050, 1150, or 1250 issite- and application-specific. The major task is traversing thestructured DML document 1040, 1140, or 1240, and converting thatinformation into the form required by the application or site. The meansof constructing such a system is evident to those familiar with the art.

As will be understood by those familiar with the art, the invention maybe embodied in other specific forms without departing from the spirit oressential characteristics thereof. For example, note that the variousalgorithms are illustrative, and variations are easily implemented. Forexample, a different cost function could be used to compute the bestmap, or the pruning step may be left out altogether. Likewise, theparticular capitalization or naming of the modules, protocols, features,attributes, data structures, or any other aspect is not mandatory orsignificant, and the mechanisms that implement the invention or itsfeatures may have different names or formats. Further, functionalitywhich is shown or described as being provided by a single module may beprovided instead by multiple different modules; likewise functionalityprovided by multiple modules may be provided instead by lesser or asingle module. Further, while a software based embodiment has beendescribed, the functionality of the invention may be embodied in wholeor in part in various hardware elements, such as application specificintegrated circuits (ASICs) or the like. The particular examples of NMLand DML are illustrative, and not limiting. Indeed, given theflexibility of the invention, it is understood that the NML and DML arenot limited to the example domains and applications discussed, but maybe applied in numerous other domains and embodiments. Accordingly, thedisclosure of the present invention is intended to be illustrative, butnot limiting, of the scope of the invention, which is set forth in thefollowing claims.

1-22. (canceled)
 23. A system, comprising: a processor configured to:tokenize a plain text description; create parse trees from the tokenizedplain text description based on grammar from a grammar storage area;generate an instance tree from each parse tree based upon an applicationdomain specific natural markup language provided by a natural markuplanguage model module; discard each invalid or incomplete instance tree;choose an instance tree from remaining instance trees representing abest map based upon a cost function; and process the best map with adomain markup language generator to generate a structured datarepresentation; and a memory coupled to the processor and configured toprovide the processor with instructions.
 24. The system of claim 23wherein the processor is further configured to use the structured datarepresentation to populate a database.
 25. The system of claim 23wherein the processor is further configured to use the structured datarepresentation to query a database.
 26. The system of claim 23 whereinthe processor is further configured to use the structured datarepresentation to invoke an application.
 27. The system of claim 23,wherein the cost function comprises choosing maps with less structureover maps with more created structure.
 28. The system of claim 23,wherein the cost function comprises: choosing maps that use the mosttokens contained in compact groups over maps using fewer tokens spreadfurther over text segments; choosing maps with the tightest possiblebindings; and choosing maps that have fewer objects.
 29. The system ofclaim 23, wherein the cost function comprises: (a) choosing a map withthe most tokens; (b) if maps are equal under (a), then choosing a maphaving a topmost expression farthest from a root of the map; (c) if mapsare equal under (a) and (b), then choosing a map with a least distancebetween tokens; (d) if maps are equal under (a) through (c), thenchoosing a map with fewer objects created by enumerations; (e) if mapsare equal under (a) through (d), then choosing a map with fewer unusedprimitives; (f) if maps are equal under (a) through (e), then choosing amap with fewer objects created by database lookup; (g) if maps are equalunder (a) through (f), then choosing a map with fewer natural markuplanguage objects; (h) if maps are equal under (a) through (g), thenchoosing a map with fewer inferred objects.
 30. The system of claim 29,wherein the cost function further comprises: (i) if maps are equal under(a) through (h), then regarding all maps as equally valid.
 31. Thesystem of claim 23, wherein all possible parse trees from the tokenizedplain text are created.
 32. The system of claim 23, wherein theprocessor is further configured to represent all of the parse trees in asingle directed acyclic graph.
 33. The system of claim 23, wherein thegrammar from the grammar storage area is context free.
 34. A computerprogram product embodied in a computer readable medium and comprisingcomputer instructions for: tokenizing a plain text description; creatingparse trees from the tokenized plain text description based on grammarfrom a grammar storage area; generating an instance tree from each parsetree based upon an application domain specific natural markup languageprovided by a natural markup language model module; discarding eachinvalid or incomplete instance tree; choosing an instance tree fromremaining instance trees representing a best map based upon a costfunction; and processing the best map with a domain markup languagegenerator to generate a structured data representation.
 35. The computerprogram product of claim 34 further comprising computer instructions forusing the structured data representation to populate a database.
 36. Thecomputer program product of claim 34 further comprising computerinstructions for using the structured data representation to query adatabase.
 37. The computer program product of claim 34 furthercomprising computer instructions for using the structured datarepresentation to invoke an application.
 38. The computer programproduct of claim 34, wherein the cost function comprises choosing mapswith less structure over maps with more created structure.
 39. Thecomputer program product of claim 34, wherein the cost functioncomprises: choosing maps that use the most tokens contained in compactgroups over maps using fewer tokens spread further over text segments;choosing maps with the tightest possible bindings; and choosing mapsthat have fewer objects.
 40. The computer program product of claim 34,wherein the cost function comprises: (a) choosing a map with the mosttokens; (b) if maps are equal under (a), then choosing a map having atopmost expression farthest from a root of the map; (c) if maps areequal under (a) and (b), then choosing a map with a least distancebetween tokens; (d) if maps are equal under (a) through (c), thenchoosing a map with fewer objects created by enumerations; (e) if mapsare equal under (a) through (d), then choosing a map with fewer unusedprimitives; (f) if maps are equal under (a) through (e), then choosing amap with fewer objects created by database lookup; (g) if maps are equalunder (a) through (f), then choosing a map with fewer natural markuplanguage objects; (h) if maps are equal under (a) through (g), thenchoosing a map with fewer inferred objects.
 41. The computer programproduct of claim 40, wherein the cost function further comprises: (i) ifmaps are equal under (a) through (h), then regarding all maps as equallyvalid.
 42. The computer program product of claim 34, wherein the grammarfrom the grammar storage area is context free.